pacman::p_load(tidyverse, sf, httr, jsonlite, rvest, tmap, leaflet, ggstatsplot, spdep, spgwr, olsrr, gtsummary, GWmodel, rsample, ranger, SpatialML)
Take-home Exercise 3
1 Overview
1.1 Objective
The goal of this analysis is to predict the resale prices of Housing and Development Board (HDB) flats in Singapore for the period of July to September 2024. Using data from 2023, we aim to build a predictive model that considers both structural and locational factors, capturing the unique spatial characteristics that influence HDB resale prices. This model will provide valuable insights for:
-
Potential buyers looking to make informed investment decisions
-
Real estate investors seeking accurate market forecasts
-
Policymakers aiming to understand spatial trends in housing affordability
1.2 Context
Housing in Singapore, particularly HDB flats, represents a critical component of household wealth and serves as a significant investment for most residents. Given Singapore’s compact urban environment, several factors play a role in determining HDB resale prices:
-
Locational Factors: Proximity to amenities like public transportation (MRT), shopping centers, and quality schools
-
Structural Factors: Attributes such as the flat’s size, age, and floor level
-
Macro-level Influences: Economic conditions and government policies impacting the housing market
Accurately predicting resale prices is essential not only for financial planning and investment but also for urban planning and policy-making to ensure sustainable housing affordability.
1.3 Methodology Overview
Traditional models like Ordinary Least Squares (OLS) regression have limitations when applied to spatial data, as they often ignore spatial heterogeneity and autocorrelation. These limitations include:
-
Spatial Heterogeneity: Relationships between housing prices and influencing factors vary across locations.
-
Spatial Autocorrelation: Housing prices in nearby areas tend to be similar, leading to clustering effects.
To address these challenges, we will employ Geographically Weighted Models (GWMs), specifically:
- Geographically Weighted Regression (GWR), which captures spatial variability in linear relationships.
By comparing the performance of OLS and GWR models, this study will demonstrate the effectiveness of spatially weighted approaches for real estate price prediction in Singapore.
2 Data Upload and Initial Setup
2.1 Installing and launching the R packages
In this exercise, the following R packages will be used, they are:
- tidyverse: A collection of R packages (including
dplyr,ggplot2,tidyr, and more) for data manipulation, visualization, and cleaning. It is essential for streamlined data handling and is widely used for data wrangling and efficient manipulation of data frames. - sf (Simple Features): A package that provides a standard approach for handling spatial data, such as shapefiles and geographic coordinates, in R. It’s useful for transforming data into spatial formats and performing spatial operations.
- httr: Facilitates HTTP requests, enabling access to external APIs to fetch locational or additional data about amenities or other contextual factors that may influence housing prices.
- jsonlite: A package used for parsing JSON data, often encountered in web APIs. This package is useful for converting JSON data into R data structures, allowing for seamless integration of JSON-formatted locational or contextual data.
- rvest: Supports web scraping, making it easy to extract data from websites. This can be useful if additional information from web sources (such as lists of nearby amenities or environmental factors) is required for analysis.
- tmap: A powerful package for creating static and interactive thematic maps. It’s helpful for visualizing spatial patterns, clusters, and trends in housing prices or other variables across geographic areas.
- leaflet: A mapping package focused on interactive maps. It is useful for creating dynamic spatial visualizations, which can help communicate results effectively to stakeholders.
- ggstatsplot: An extension of
ggplot2for enhanced statistical visualizations, adding statistical information and context to graphs. It’s useful for presenting both spatial and non-spatial relationships within the dataset. - spdep: Used for spatial dependency analysis,
spdepprovides tools for calculating spatial autocorrelation (e.g., Moran’s I) and creating spatial weights, essential for analyzing spatial relationships among housing prices or other spatial data points. - spgwr: Implements Geographically Weighted Regression (GWR) in R. This is useful for local regression analyses that reveal spatial variations in relationships, such as the effect of locational and structural factors on housing prices.
- olsrr: A package for ordinary least squares (OLS) regression diagnostics, which can aid in assessing model assumptions, identifying influential observations, and evaluating model performance.
- gtsummary: Provides summary tables and statistics in a clean format, making it easy to generate quick overviews of data or model outputs. Useful for generating reports with organized statistical summaries.
- GWmodel: A specialized package for geographically weighted models, including Geographically Weighted Random Forests (GWRF), which are advanced models that capture complex spatial patterns in data.
- rsample: A package for creating resampling objects, which is useful for cross-validation and other validation strategies to assess model performance on different subsets of data.
- ranger: An efficient implementation of the Random Forest algorithm in R, which can
handle large datasets and be applied in predictive modeling tasks, including spatial modeling when
combined with
GWmodel. - spatialML: Supports machine learning on spatial data, providing tools that are specifically designed to handle the unique characteristics of spatial data in predictive modeling.
3 Data Import and Preparation
3.1 Primary Dataset
HDB Resale Flat Prices: The primary dataset for this analysis is the HDB Resale Flat Prices dataset available from Data.gov.sg. This dataset provides information on resale transactions for HDB flats, including price, location, flat type, and structural details. Key fields in this dataset that will be used for the analysis include:
-
Flat Type: Differentiates between three-room, four-room, and five-room flats, which are the focus of the study.
-
Transaction Price: The resale price of each HDB flat, which is the dependent variable to be predicted.
-
Floor Area and Floor Level: Measures of the flat’s size and position, which are structural factors that impact value.
-
Remaining Lease: The number of years left on the lease, crucial for pricing as HDB flats depreciate over time.
resale <- read_csv("data/HDB/rawdata/resale.csv") %>%
filter(
month >= "2023-01" & month <= "2023-12",
flat_type %in% c("3 ROOM", "4 ROOM", "5 ROOM")
)
Rows: 192234 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): month, town, flat_type, block, street_name, storey_range, flat_mode...
dbl (3): floor_area_sqm, lease_commence_date, resale_price
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The regional classification data for Singapore was sourced from Wikipedia, a widely recognized reference for administrative and geographical information. This dataset delineates Singapore into five primary regions: North, North-East, East, Central, and West, and assigns specific towns or planning areas to each region. Incorporating this classification enables a structured spatial analysis by providing a consistent regional framework. By joining this regional data with the primary dataset based on town names, the analysis can account for spatial heterogeneity and facilitate comparisons across regions, thus enhancing the rigor and depth of socio-economic or real estate trend analysis within Singapore.
region <- read_csv("data/HDB/rawdata/region.csv")
Rows: 55 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): town, region
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Define a helper function to capitalize the first letter of each word
capitalize_words <- function(x) {
sapply(strsplit(x, " "), function(words) {
paste(toupper(substring(words, 1, 1)), tolower(substring(words, 2)), sep = "", collapse = " ")
})
}
# Apply the helper function to standardize the `town` column to `region`
region <- region %>%
mutate(town = capitalize_words(town))
resale_tidy <- resale %>%
# Create a new `address` column by combining `block` and `street_name`
mutate(address = paste(block, street_name)) %>%
# Extract the first two characters of `remaining_lease` as years and convert to integer
mutate(remaining_lease_yr = as.integer(
str_sub(remaining_lease, 0, 2))) %>%
# Extract characters from position 9 to 11 of `remaining_lease` as months and convert to integer , default to 0 if missing
mutate(remaining_lease_mth = if_else(is.na(as.integer(str_sub(remaining_lease, 9, 11))),
0,
as.integer(str_sub(remaining_lease, 9, 11)))) %>%
# Apply the helper function to standardize the `town` column in `resale_tidy` dataset
mutate(town = capitalize_words(town)) %>%
# Perform the join
left_join(region, by = "town") %>%
# Manually assign "Central" region to specific towns after the join
mutate(region = ifelse(town %in% c("Central Area", "Kallang/whampoa"), "Central", region)) %>%
# Calculate total remaining lease in months by converting years to months and adding the extracted months
mutate(
remaining_lease_total_mths = (remaining_lease_yr * 12) + remaining_lease_mth
) %>%
# Remove the intermediate columns `remaining_lease_yr` and `remaining_lease_mth`
select(-remaining_lease_yr, -remaining_lease_mth) %>%
# Split `storey_range` into minimum and maximum storey columns
# Extract the first two characters of `storey_range` as the minimum storey level
mutate(storey_min = as.integer(str_sub(storey_range, 1, 2))) %>%
# Extract the last two characters of `storey_range` as the maximum storey level
mutate(storey_max = as.integer(str_sub(storey_range, 7, 8))) %>%
# Calculate the average storey level by averaging `storey_min` and `storey_max`
mutate(
storey_avg = (storey_min + storey_max) / 2
)
write_rds(resale_tidy, "data/HDB/rds/resale_tidy.rds")
resale_tidy <- read_rds("data/HDB/rds/resale_tidy.rds")
add_list <- sort(unique(resale_tidy$address))
get_coords <- function(add_list){
# Create a data frame to store all retrieved coordinates
postal_coords <- data.frame()
for (i in add_list){
#print(i)
r <- GET('https://www.onemap.gov.sg/api/common/elastic/search?',
query=list(searchVal=i,
returnGeom='Y',
getAddrDetails='Y'))
data <- fromJSON(rawToChar(r$content))
found <- data$found
res <- data$results
# Create a new data frame for each address
new_row <- data.frame()
# If single result, append
if (found == 1){
postal <- res$POSTAL
lat <- res$LATITUDE
lng <- res$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
# If multiple results, drop NIL and append top 1
else if (found > 1){
# Remove those with NIL as postal
res_sub <- res[res$POSTAL != "NIL", ]
# Set as NA first if no Postal
if (nrow(res_sub) == 0) {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
else{
top1 <- head(res_sub, n = 1)
postal <- top1$POSTAL
lat <- top1$LATITUDE
lng <- top1$LONGITUDE
new_row <- data.frame(address= i,
postal = postal,
latitude = lat,
longitude = lng)
}
}
else {
new_row <- data.frame(address= i,
postal = NA,
latitude = NA,
longitude = NA)
}
# Add the row
postal_coords <- rbind(postal_coords, new_row)
}
return(postal_coords)
}
coords <- get_coords(add_list)
write_rds(coords, "data/HDB/rds/coords.rds")
coords <- read_rds("data/HDB/rds/coords.rds")
# Ensure that both address columns are in uppercase and have consistent formatting
resale_tidy <- resale_tidy %>%
mutate(address = toupper(address))
coords <- coords %>%
mutate(address = toupper(address))
# Perform the join by address column
resale_combined <- resale_tidy %>%
left_join(coords, by = "address")
# Select and arrange columns in the specified order
resale_final <- resale_combined %>%
select(resale_price, town, region, flat_type, flat_model, floor_area_sqm, storey_avg, remaining_lease_total_mths, latitude, longitude)
# Display the final data
head(resale_final)
# A tibble: 6 × 10
resale_price town region flat_type flat_model floor_area_sqm storey_avg
<dbl> <chr> <chr> <chr> <chr> <dbl> <dbl>
1 380000 Ang Mo Kio North-… 3 ROOM New Gener… 67 5
2 635000 Ang Mo Kio North-… 3 ROOM Model A 70 26
3 380000 Ang Mo Kio North-… 3 ROOM New Gener… 67 8
4 365000 Ang Mo Kio North-… 3 ROOM New Gener… 73 5
5 418000 Ang Mo Kio North-… 3 ROOM New Gener… 73 8
6 380000 Ang Mo Kio North-… 3 ROOM New Gener… 67 5
# ℹ 3 more variables: remaining_lease_total_mths <dbl>, latitude <chr>,
# longitude <chr>
# Convert resale_final to an sf object if it has latitude and longitude columns
resale_final <- resale_final %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) # Set the original CRS (e.g., WGS84)
# Transform to the desired CRS (e.g., Singapore's SVY21 CRS)
resale_final <- st_transform(resale_final, crs = 3414)
3.2 Secondary Data — Locational Data
To enhance the predictive model, we incorporate secondary data sources to capture locational factors that influence HDB resale prices. These factors typically require geographic data about the proximity to amenities, which can be collected from multiple sources:
-
Public Transportation (MRT Stations):
- Data Source: LTA MRT Station Exit (GEOJSON) dataset from Data.gov.sg, provided by the Land Transport Authority (LTA).
- Purpose: This dataset provides the geographical coordinates of each MRT station exit. Using this data will allow us to calculate the precise distance from HDB flats to the nearest MRT exit, giving a more accurate measure of accessibility than a central station location would.
- Usage in Analysis: By incorporating MRT station exits instead of just station locations, we can improve the precision of our proximity calculations. Proximity to MRT stations is a critical factor influencing HDB resale prices, as flats closer to MRT exits are generally more attractive to buyers due to the ease of access to public transport.
- By using MRT station exit data, we can calculate the shortest walking distance from each HDB flat to the nearest MRT exit, enhancing the locational data quality and potentially improving the predictive power of the model for HDB resale prices.
mrt <- st_read("data/Locational/rawdata/mrt.geojson")
mrt <- st_transform(mrt, crs = 3414)
# Calculate nearest distance to MRT station
nearest_mrt <- st_nearest_feature(resale_final, mrt)
resale_final <- resale_final %>%
mutate(proximity_to_mrt = st_distance(resale_final, mrt[nearest_mrt, ], by_element = TRUE))%>%
mutate(proximity_to_mrt = as.numeric(proximity_to_mrt) / 1000) # Convert meters to km
-
Good Primary Schools:
-
Data Source: Ministry of Education’s data on schools (from the CSV file provided) and additional information from Math Nuggets - Primary School Rankings 2024.
Primary School Rankings 2024, extracted from: https://mathnuggets.sg/best-primary-schools-in-singapore/ Ranking School 1 Methodist Girls’ School (Primary) 2 Tao Nan School 3 Ai Tong School 4 Holy Innocents’ Primary School 5 CHIJ St. Nicholas Girls’ School (Primary) 6 Admiralty Primary School 7 St. Joseph’s Institution Junior 8 Catholic High School (Primary) 9 Anglo-Chinese School (Junior) 10 Chongfu School 11 Kong Hwa School 12 St. Hilda’s Primary School 13 Anglo-Chinese School (Primary) 14 Nan Chiau Primary School 15 Nan Hua Primary School 16 Nanyang Primary School 17 Pei Hwa Presbyterian Primary School 18 Kuo Chuan Presbyterian Primary School 19 Rulang Primary School 20 Singapore Chinese Girls’ Primary School -
Purpose: Proximity to top 20 primary schools often increases property values due to the demand for accessible quality education. This analysis will focus on the top primary schools as per the 2024 rankings, which will be scraped from the website. Only the schools ranked as the best in Singapore will be included in the dataset, filtered to match the
school_namefield in full capital letters for consistency.
-
# Load the CSV file with school information
school_data <- read.csv("data/Locational/rawdata/Generalinformationofschools.csv")
# Define a list of top 20 primary schools in uppercase
top_schools <- c(
"METHODIST GIRLS' SCHOOL (PRIMARY)",
"TAO NAN SCHOOL",
"AI TONG SCHOOL",
"HOLY INNOCENTS' PRIMARY SCHOOL",
"CHIJ ST. NICHOLAS GIRLS' SCHOOL",
"ADMIRALTY PRIMARY SCHOOL",
"ST. JOSEPH'S INSTITUTION JUNIOR",
"CATHOLIC HIGH SCHOOL",
"ANGLO-CHINESE SCHOOL (JUNIOR)",
"CHONGFU SCHOOL",
"KONG HWA SCHOOL",
"ST. HILDA'S PRIMARY SCHOOL",
"ANGLO-CHINESE SCHOOL (PRIMARY)",
"NAN CHIAU PRIMARY SCHOOL",
"NAN HUA PRIMARY SCHOOL",
"NANYANG PRIMARY SCHOOL",
"PEI HWA PRESBYTERIAN PRIMARY SCHOOL",
"KUO CHUAN PRESBYTERIAN PRIMARY SCHOOL",
"RULANG PRIMARY SCHOOL",
"SINGAPORE CHINESE GIRLS' PRIMARY SCHOOL"
)
# Filter the CSV data to keep only rows with top primary schools
filtered_school_data <- school_data %>%
filter(school_name %in% top_schools) %>%
select(school_name, postal_code)
# Prepare list of postal codes
postal_codes <- filtered_school_data$postal_code
# Use the get_coords function to retrieve coordinates
school_coords <- get_coords(postal_codes)
# Convert postal_code to character in filtered_school_data
filtered_school_data <- filtered_school_data %>%
mutate(postal_code = as.character(postal_code))
# Merge the coordinates back with the filtered school data
final_school_data <- filtered_school_data %>%
left_join(school_coords, by = c("postal_code" = "postal"))
# Display final dataset
head(final_school_data)
# Convert final_school_data to an sf object if it has latitude and longitude columns
final_school_data <- final_school_data %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 4326) # Set the original CRS (e.g., WGS84)
# Transform to the desired CRS (e.g., Singapore's SVY21 CRS)
final_school_data <- st_transform(final_school_data, crs = 3414)
# Calculate nearest distance to good school
nearest_goodprisch <- st_nearest_feature(resale_final, final_school_data)
resale_final <- resale_final %>%
mutate(proximity_to_goodprisch = st_distance(resale_final, final_school_data[nearest_goodprisch, ], by_element = TRUE))%>%
mutate(proximity_to_goodprisch = as.numeric(proximity_to_goodprisch) / 1000) # Convert meters to km
#Create a buffer of 1km around each resale flat
buffer_1km <- st_buffer(resale_final,
dist = 1000)
#Plot the newly created buffers and each good primary school
tmap_mode("view")
tm_shape(buffer_1km) +
tm_polygons() +
tm_shape(final_school_data) +
tm_dots()
#Count the number of good primary schools
resale_final$within_1km_prisch <- lengths(
st_intersects(buffer_1km, final_school_data))
-
Healthcare and Eldercare Facilities:
-
Data Source: dataset from Data.gov.sg, provided by the Ministry of Health (MOH).
-
Purpose: Accessibility to healthcare facilities can be a factor, especially for buyers looking for long-term residence or properties suitable for elderly family members.
-
eldercare <- st_read(dsn = "data/Locational/rawdata",
layer = "ELDERCARE") %>%
st_transform(crs = 3414)
# Calculate nearest distance to eldercare
nearest_eldercare <- st_nearest_feature(resale_final, eldercare)
resale_final <- resale_final %>%
mutate(proximity_to_eldercare = st_distance(resale_final, eldercare[nearest_eldercare, ], by_element = TRUE))%>%
mutate(proximity_to_eldercare = as.numeric(proximity_to_eldercare) / 1000) # Convert meters to km
CHAS <- st_read("data/Locational/rawdata/CHAS Clinics.kml") %>%
st_transform(crs = 3414)
# Calculate nearest distance to CHAS clinic
nearest_CHAS <- st_nearest_feature(resale_final, CHAS)
resale_final <- resale_final %>%
mutate(proximity_to_CHAS = st_distance(resale_final, CHAS[nearest_CHAS, ], by_element = TRUE))%>%
mutate(proximity_to_CHAS = as.numeric(proximity_to_CHAS) / 1000) # Convert meters to km
-
Supermarkets:
- Data Source: dataset from Data.gov.sg, provided by the Singapore Food Agency (SFA).
- Purpose: Supermarkets play a crucial role in daily life by providing access to essential groceries and household items. Proximity to a supermarket is highly desirable for residents, especially for families. As such, flats located near supermarkets are generally more attractive, as they offer added convenience for everyday needs, positively impacting resale prices.
spmrkt <- st_read("data/Locational/rawdata/Supermarket.geojson")
summary(spmrkt)
spmrkt <- st_transform(spmrkt, crs = 3414)
# Calculate nearest distance to supermarket
nearest_spmrkt <- st_nearest_feature(resale_final, spmrkt)
resale_final <- resale_final %>%
mutate(proximity_to_spmrkt = st_distance(resale_final, spmrkt[nearest_spmrkt, ], by_element = TRUE))%>%
mutate(proximity_to_spmrkt = as.numeric(proximity_to_spmrkt) / 1000) # Convert meters to km
-
Food Amenities:
- Data Source: dataset from Data.gov.sg, provided by the National Environment Agency (NEA).
- Purpose: Access to diverse food options is highly valued in Singapore. Flats located near popular hawker centers are likely to have higher resale values, as they offer residents convenient access to a variety of affordable dining options, enhancing the lifestyle appeal of the location.
hawker <- st_read("data/Locational/rawdata/Hawker.geojson")
summary(hawker)
hawker <- st_transform(hawker, crs = 3414)
# Calculate nearest distance to hawker
nearest_hawker <- st_nearest_feature(resale_final, hawker)
resale_final <- resale_final %>%
mutate(proximity_to_hawker = st_distance(resale_final, hawker[nearest_hawker, ], by_element = TRUE))%>%
mutate(proximity_to_hawker = as.numeric(proximity_to_hawker) / 1000) # Convert meters to km
-
Parks and Nature Reserves:
-
Data Source: dataset from Data.gov.sg, provided by the National Parks Board (NPARKS).
-
Purpose: Proximity to parks and nature reserves can increase property appeal for families and nature enthusiasts.
-
parks <- st_read("data/Locational/rawdata/Parks.kml") %>%
st_transform(crs = 3414)
# Calculate nearest distance to parks and nature reserves
nearest_parks <- st_nearest_feature(resale_final, parks)
resale_final <- resale_final %>%
mutate(proximity_to_parks = st_distance(resale_final, parks[nearest_parks, ], by_element = TRUE))%>%
mutate(proximity_to_parks = as.numeric(proximity_to_parks) / 1000) # Convert meters to km
-
Childcare Facilities:
-
Data Source: Childcare center locations from Data.gov.sg or private datasets if available.
-
Purpose: The presence of childcare facilities nearby is valuable for young families, potentially influencing their interest in the property.
-
childcare <- st_read("data/Locational/rawdata/Childcare.geojson") %>%
st_transform(crs = 3414)
# Calculate nearest distance to childcare
nearest_childcare <- st_nearest_feature(resale_final, childcare)
resale_final <- resale_final %>%
mutate(proximity_to_childcare = st_distance(resale_final, childcare[nearest_childcare, ], by_element = TRUE))%>%
mutate(proximity_to_childcare = as.numeric(proximity_to_childcare) / 1000) # Convert meters to km
#Create a buffer of 350m around each resale flat
buffer_350m <- st_buffer(resale_final,
dist = 350)
#Plot the newly created buffers and each childcare
tmap_mode("view")
tm_shape(buffer_350m) +
tm_polygons() +
tm_shape(childcare) +
tm_dots()
#Count the number of childcare
resale_final$within_350m_childcare <- lengths(
st_intersects(buffer_350m, childcare))
-
Public Transportation Accessibility:
-
Bus Stops: Data on bus stop locations within 350m or 1km from the property, sourced from LTA’s DataMall
-
Purpose: Ease of access to public transportation increases a property’s attractiveness and can be a significant factor in resale value.
-
busstop <- st_read(dsn = "data/Locational/rawdata",
layer = "BusStop") %>%
st_transform(crs = 3414)
# Calculate nearest distance to bus stop
nearest_busstop <- st_nearest_feature(resale_final, busstop)
resale_final <- resale_final %>%
mutate(proximity_to_busstop = st_distance(resale_final, busstop[nearest_busstop, ], by_element = TRUE))%>%
mutate(proximity_to_busstop = as.numeric(proximity_to_busstop) / 1000) # Convert meters to km
#Plot the newly created buffers and each bus stops
tmap_mode("view")
tm_shape(buffer_350m) +
tm_polygons() +
tm_shape(busstop) +
tm_dots()
#Count the number of bus stops
resale_final$within_350m_busstop <- lengths(
st_intersects(buffer_350m, busstop))
-
Shopping Amenities:
- Data Source: Location data on shopping malls can be sourced from OpenStreetMap or business directories in Singapore.
- Purpose: Shopping malls provide access to a wide range of retail options, entertainment, and services, which enhance the appeal of nearby properties. Flats close to malls are often in higher demand due to the convenience of having various amenities within reach, potentially increasing resale values.
# Define the URL of the Wikipedia page
url <- "https://en.wikipedia.org/wiki/List_of_shopping_malls_in_Singapore"
# Read the HTML content of the page
page <- read_html(url)
# Extract mall names from list items across all sections
mall <- page %>%
html_nodes("ul > li") %>% # Selects all list items that are direct children of unordered lists
html_text()
# Filter mall names to exclude non-mall related content, if needed
mall <- mall[mall != ""]
# Remove the "[1]" notation from Knightsbridge or similar entries
mall <- gsub("\\[1\\]", "", mall)
# Extract only lines 50 to 223
mall <- mall[50:223]
# Make specific replacements in the extracted range
mall[42] <- gsub("GRiD\\(pomo\\)", "GR.ID", mall[42]) # Change "GRiD(pomo)" to "GR.ID"
mall[85] <- gsub("Paya Lebar Quarter \\(PLQ\\)", "Paya Lebar Quarter", mall[85]) # Change "Paya Lebar Quarter (PLQ)" to "Paya Lebar Quarter"
# Print extracted and modified mall names for the specified range
print(mall)
# Function to get coordinates from OneMap API
get_coordinates <- function(mall_name) {
base_url <- "https://www.onemap.gov.sg/api/common/elastic/search?"
response <- GET(base_url, query = list(searchVal = mall_name, returnGeom = "Y", getAddrDetails = "N"))
data <- content(response, "parsed")
if (length(data$results) > 0) {
result <- data$results[[1]]
return(c(result$X, result$Y))
} else {
return(c(NA, NA))
}
}
# Initialize vectors to store coordinates
x_coords <- numeric(length(mall))
y_coords <- numeric(length(mall))
# Loop through each mall name to get coordinates
for (i in seq_along(mall)) {
coords <- get_coordinates(mall[i])
x_coords[i] <- coords[1]
y_coords[i] <- coords[2]
Sys.sleep(1) # Pause to respect API rate limits
}
# Combine mall names and their coordinates into a data frame
mall_coord <- data.frame(
Mall_Name = mall,
longitude = x_coords,
latitude = y_coords,
stringsAsFactors = FALSE
)
# Print the data frame to check the results
print(mall_coord)
# Convert to an sf object with EPSG:3414 (SVY21 coordinate system for Singapore)
mall_coord <- mall_coord %>%
st_as_sf(coords = c("longitude", "latitude"), crs = 3414) # Set the original CRS (e.g., WGS84)
# Print the updated mall_coord to view the geometry points in (longitude, latitude)
print(mall_coord)
write_rds(mall_coord, "data/HDB/rds/mall_coord.rds")
mall_coord <- read_rds("data/HDB/rds/mall_coord.rds")
# Plot using tmap
tmap_mode("view")
tm_shape(mall_coord) +
tm_dots(col = "blue", size = 0.1, alpha = 0.8) +
tm_basemap("OpenStreetMap")
# Calculate nearest distance to shopping mall
nearest_mall <- st_nearest_feature(resale_final, mall_coord)
resale_final <- resale_final %>%
mutate(proximity_to_mall = st_distance(resale_final, mall_coord[nearest_mall, ], by_element = TRUE))%>%
mutate(proximity_to_mall = as.numeric(proximity_to_mall) / 1000) # Convert meters to km
write_rds(resale_final, "data/HDB/rds/resale_final.rds")
resale_final <- read_rds("data/HDB/rds/resale_final.rds")
Each of these secondary data sources will be processed to derive proximity variables (e.g., distance to nearest MRT station, number of schools within 1km) for inclusion in the predictive model. This helps capture the spatial elements that influence property prices beyond the structural features of the flats themselves.
3.3 Spatial Autocorrelation Check
The purpose of checking for spatial autocorrelation is to determine whether HDB resale prices exhibit spatial dependency. If there is a significant spatial autocorrelation, it means that the prices of nearby HDB flats tend to be similar. This spatial dependency justifies the use of geographically weighted models, which are designed to handle location-based variations.
Moran’s I is a statistical measure that assesses the degree of spatial autocorrelation in a dataset. It ranges from -1 to +1:
-
Positive Values: Indicate positive spatial autocorrelation, where similar values (e.g., high or low resale prices) are clustered together.
-
Negative Values: Indicate negative spatial autocorrelation, where dissimilar values are adjacent.
-
Value Near Zero: Suggests spatial randomness, where there is no clear pattern in the spatial distribution.
Ensure that the coords data is in spatial format (e.g., sf object in R) with
coordinates for each resale flat. We’ll need latitude and longitude coordinates for each record to analyze
spatial relationships.
A spatial weight matrix defines the spatial relationships between points. We can create a weight matrix based on distance or neighborhood contiguity.
For HDB flats, a distance-based approach (e.g., nearest neighbors) is often appropriate.
When calculating k-nearest neighbors, identical points cause issues because the algorithm cannot uniquely identify the nearest neighbors if multiple points are located at the same coordinates. We can aggregate the data by averaging or summing relevant attributes to keep one unique location per point.
resale_aggregated <- resale_final %>%
group_by(flat_type, floor_area_sqm, storey_avg, remaining_lease_total_mths, geometry) %>%
summarize(
resale_price = mean(resale_price, na.rm = TRUE)
) %>%
ungroup()
write_rds(resale_aggregated, "data/HDB/rds/resale_aggregated.rds")
resale_aggregated <- read_rds("data/HDB/rds/resale_aggregated.rds")
Use Moran’s I to test for spatial autocorrelation of resale prices across the dataset.
A significant Moran’s I value (with a p-value < 0.05) indicates spatial autocorrelation, suggesting that geographically weighted models (e.g., GWR) are appropriate for this analysis.
Interpret Moran’s I Results:
-
Positive Moran’s I Value (with significant p-value): Indicates clustering of similar prices (e.g., high or low resale prices in specific areas), justifying the use of geographically weighted models to account for spatial variability.
-
Non-significant Moran’s I Value (or close to zero): Implies no spatial autocorrelation, suggesting that traditional models like Ordinary Least Squares (OLS) might be sufficient as there is no strong spatial dependency in the data.
# Convert geometry to spatial coordinates for neighborhood matrix
coords <- st_coordinates(resale_aggregated)
# Create a k-nearest neighbor structure
knn <- knearneigh(coords, k = 5) # Set k to the desired number of neighbors
nb <- knn2nb(knn)
weight_matrix <- nb2listw(nb, style = "W")
# Calculate Moran's I
moran_test <- moran.test(resale_aggregated$resale_price, weight_matrix)
write_rds(moran_test, "data/HDB/rds/moran_test.rds")
moran_test <- read_rds("data/HDB/rds/moran_test.rds")
# View results
print(moran_test)
Moran I test under randomisation
data: resale_aggregated$resale_price
weights: weight_matrix
Moran I statistic standard deviate = 172, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic Expectation Variance
6.617663e-01 -4.329942e-05 1.480471e-05
To interpret the results of Moran’s I test for spatial autocorrelation, let’s break down the key components of the output:
3.3.1 Moran’s I Statistic:
-
Value: approximately 0.662
-
Interpretation: Moran’s I ranges from -1 to +1. Values close to +1 indicate strong positive spatial autocorrelation, meaning similar values are clustered together in space. Values close to -1 indicate strong negative spatial autocorrelation, meaning dissimilar values are located near each other. A value of 0 suggests a random spatial pattern (no spatial autocorrelation).
-
In your case, a Moran’s I of approximately 0.662 indicates moderate to strong positive spatial autocorrelation. This suggests that higher resale prices tend to cluster with other higher resale prices, and lower prices tend to cluster with other lower prices in the spatial layout of your dataset.
3.3.2 Expectation:
-
Value: -4.329942e-05 (close to zero).
-
Interpretation: This is the expected value of Moran’s I under the null hypothesis of spatial randomness (no spatial autocorrelation). If the observed Moran’s I significantly deviates from this expectation, it suggests spatial autocorrelation.
3.3.3 Variance:
-
Value: 1.480471e-05
-
Interpretation: This is the variance of Moran’s I under the null hypothesis, used to calculate the significance of the observed statistic.
3.3.4 Standard Deviation and P-Value:
-
Standard Deviation: 171.99
-
P-Value: <2.2e-16
-
Interpretation: The very low p-value indicates that the observed Moran’s I is statistically significant, rejecting the null hypothesis of spatial randomness. This means there is strong evidence of spatial clustering in resale prices, with a pattern that is unlikely to have occurred by random chance.
3.3.5 Summary of Interpretation
-
Spatial Pattern: There is strong evidence of spatial autocorrelation in resale prices, indicating that properties with similar prices are geographically clustered.
-
Significance: The significant p-value (very close to zero) confirms that the observed spatial pattern is statistically meaningful.
-
Practical Implication: Spatial factors such as proximity to amenities, neighborhood quality, or other spatially dependent factors may be influencing resale prices. You may consider using geographically weighted regression (GWR) or other spatial models to further explore these spatial relationships.
4 Exploratory Data Analysis (EDA)
EDA is essential to understand the dataset, identify patterns, and prepare it for modeling. The steps below guide you through performing a thorough EDA, with each component addressing different aspects of the data.
4.1 Descriptive Statistics
The first step in EDA is to generate descriptive statistics, which provide a numerical summary of key variables. By calculating metrics like the mean, median, standard deviation, minimum, and maximum for resale prices, as well as structural features like the area, floor level, remaining lease, and age of the unit, we can get a sense of the central tendencies and variability in the data.
This overview is helpful for spotting potential outliers or variations that may influence resale prices. For example, high variability in resale prices might indicate a wide range of property values across different regions and flat types. Descriptive statistics also help establish a baseline understanding of the dataset, highlighting the general characteristics of HDB flats within the specified time range.
The codes chunks below uses glimpse() to display the data structure of will do the
job.
glimpse(resale_final)
Rows: 23,555
Columns: 22
$ resale_price <dbl> 380000, 635000, 380000, 365000, 418000, 380…
$ town <chr> "Ang Mo Kio", "Ang Mo Kio", "Ang Mo Kio", "…
$ region <chr> "North-East", "North-East", "North-East", "…
$ flat_type <chr> "3 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", "3 …
$ flat_model <chr> "New Generation", "Model A", "New Generatio…
$ floor_area_sqm <dbl> 67, 70, 67, 73, 73, 67, 89, 68, 75, 74, 75,…
$ storey_avg <dbl> 5, 26, 8, 5, 8, 5, 8, 5, 5, 2, 2, 5, 8, 8, …
$ remaining_lease_total_mths <dbl> 649, 1065, 649, 640, 640, 643, 673, 685, 67…
$ geometry <POINT [m]> POINT (28537.68 38825.23), POINT (292…
$ proximity_to_mrt <dbl> 0.4046012, 0.7314224, 0.4046012, 0.5263600,…
$ proximity_to_goodprisch <dbl> 0.8000960, 1.0934641, 0.8000960, 1.1818905,…
$ within_1km_prisch <int> 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ proximity_to_eldercare <dbl> 0.40435866, 0.25495445, 0.40435866, 0.05809…
$ proximity_to_CHAS <dbl> 0.12666079, 0.28840296, 0.12666079, 0.06303…
$ proximity_to_spmrkt <dbl> 0.15775712, 0.31448012, 0.15775712, 0.06303…
$ proximity_to_hawker <dbl> 0.1378719, 0.3828329, 0.1378719, 0.1477741,…
$ proximity_to_parks <dbl> 0.19858307, 0.06656148, 0.19858307, 0.17137…
$ proximity_to_childcare <dbl> 1.857691e-01, 1.151054e-01, 1.857691e-01, 1…
$ within_350m_childcare <int> 2, 4, 2, 5, 5, 1, 2, 3, 3, 4, 5, 1, 3, 3, 3…
$ proximity_to_busstop <dbl> 0.16643716, 0.02802881, 0.16643716, 0.18021…
$ within_350m_busstop <int> 6, 4, 6, 4, 4, 5, 4, 10, 4, 5, 5, 4, 6, 10,…
$ proximity_to_mall <dbl> 1.0027651, 0.6647901, 1.0027651, 0.4886993,…
Next, summary() of base R is used to display the summary statistics
of resale_final tibble data frame.
summary(resale_final)
resale_price town region flat_type
Min. : 150000 Length:23555 Length:23555 Length:23555
1st Qu.: 450000 Class :character Class :character Class :character
Median : 545000 Mode :character Mode :character Mode :character
Mean : 562716
3rd Qu.: 640000
Max. :1500000
flat_model floor_area_sqm storey_avg remaining_lease_total_mths
Length:23555 Min. : 52.00 Min. : 2.00 Min. : 505.0
Class :character 1st Qu.: 75.00 1st Qu.: 5.00 1st Qu.: 730.0
Mode :character Median : 93.00 Median : 8.00 Median : 887.0
Mean : 93.37 Mean : 8.96 Mean : 889.6
3rd Qu.:110.00 3rd Qu.:11.00 3rd Qu.:1092.0
Max. :176.00 Max. :50.00 Max. :1154.0
geometry proximity_to_mrt proximity_to_goodprisch
POINT :23555 Min. :0.01463 Min. :0.04958
epsg:3414 : 0 1st Qu.:0.30168 1st Qu.:1.17410
+proj=tmer...: 0 Median :0.51494 Median :1.90283
Mean :0.58134 Mean :2.07394
3rd Qu.:0.77348 3rd Qu.:2.63467
Max. :3.49822 Max. :7.16529
within_1km_prisch proximity_to_eldercare proximity_to_CHAS proximity_to_spmrkt
Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.000 1st Qu.:0.3218 1st Qu.:0.1078 1st Qu.:0.1723
Median :0.000 Median :0.6156 Median :0.1690 Median :0.2662
Mean :0.198 Mean :0.7870 Mean :0.1844 Mean :0.2880
3rd Qu.:0.000 3rd Qu.:1.0739 3rd Qu.:0.2420 3rd Qu.:0.3787
Max. :2.000 Max. :4.7675 Max. :2.7122 Max. :3.3254
proximity_to_hawker proximity_to_parks proximity_to_childcare
Min. :0.006981 Min. :0.006039 Min. :0.00000
1st Qu.:0.351676 1st Qu.:0.303246 1st Qu.:0.07462
Median :0.628092 Median :0.493864 Median :0.11856
Mean :0.747090 Mean :0.569264 Mean :0.12951
3rd Qu.:1.007198 3rd Qu.:0.737417 3rd Qu.:0.17331
Max. :2.867630 Max. :2.066652 Max. :2.91807
within_350m_childcare proximity_to_busstop within_350m_busstop
Min. : 0.000 Min. :0.01543 Min. : 0.00
1st Qu.: 2.000 1st Qu.:0.07423 1st Qu.: 6.00
Median : 3.000 Median :0.10751 Median : 8.00
Mean : 3.695 Mean :0.11455 Mean : 7.89
3rd Qu.: 5.000 3rd Qu.:0.14592 3rd Qu.:10.00
Max. :20.000 Max. :0.39147 Max. :19.00
proximity_to_mall
Min. :0.0000
1st Qu.:0.3810
Median :0.5887
Mean :0.6486
3rd Qu.:0.8614
Max. :3.1782
4.2 Visualize Relationships
To gain insights into the relationships between resale prices and various structural and locational features, we can create scatter plots and violin plots.
Scatter plots allow us to examine how resale prices vary with continuous variables like area, age, and remaining lease. For instance, plotting resale price against area may reveal a positive trend, suggesting that larger flats generally command higher prices. Similarly, a scatter plot of resale price versus age could show whether older flats have lower resale values, providing insight into the impact of depreciation. By mapping prices geographically, we can identify areas with higher or lower resale values, giving us an understanding of spatial clustering in the data. These visualizations make it easier to observe trends and relationships that may not be apparent from summary statistics alone.
# Scatter plot: Resale price vs Area
plot_area <- ggplot(resale_final, aes(x = floor_area_sqm, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Area", x = "Area (sqm)", y = "Resale Price")
ggsave("plot_resale_price_vs_area.png", plot = plot_area, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Storey
plot_storey <- ggplot(resale_final, aes(x = storey_avg, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Storey", x = "Storey", y = "Resale Price")
ggsave("plot_resale_price_vs_storey.png", plot = plot_storey, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Age of Flat
plot_age <- ggplot(resale_final, aes(x = remaining_lease_total_mths, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Age of Flat", x = "Age (months)", y = "Resale Price")
ggsave("plot_resale_price_vs_age.png", plot = plot_age, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest MRT exits
plot_mrt <- ggplot(resale_final, aes(x = proximity_to_mrt, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest MRT exits", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_mrt.png", plot = plot_mrt, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Good Primary School
plot_goodprisch <- ggplot(resale_final, aes(x = proximity_to_goodprisch, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Good Primary School", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_goodprisch.png", plot = plot_goodprisch, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs No. of Good Primary School within 1km
plot_prisch <- ggplot(resale_final, aes(x = within_1km_prisch, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. No. of Good Primary School within 1km", x = "No. of Good Primary Schools", y = "Resale Price")
ggsave("plot_resale_price_vs_within_1km_prisch.png", plot = plot_prisch, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Eldercare
plot_eldercare <- ggplot(resale_final, aes(x = proximity_to_eldercare, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Eldercare", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_eldercare.png", plot = plot_eldercare, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest CHAS Clinic
plot_chas <- ggplot(resale_final, aes(x = proximity_to_CHAS, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest CHAS Clinic", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_chas.png", plot = plot_chas, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Supermarket
plot_spmrkt <- ggplot(resale_final, aes(x = proximity_to_spmrkt, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Supermarket", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_spmrkt.png", plot = plot_spmrkt, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Hawker
plot_hawker <- ggplot(resale_final, aes(x = proximity_to_hawker, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Hawker", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_hawker.png", plot = plot_hawker, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Park
plot_parks <- ggplot(resale_final, aes(x = proximity_to_parks, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Park", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_parks.png", plot = plot_parks, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs No. of Childcares within 350m
plot_childcare <- ggplot(resale_final, aes(x = within_350m_childcare, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. No. of Childcares within 350m", x = "No. of Childcares", y = "Resale Price")
ggsave("plot_resale_price_vs_childcare.png", plot = plot_childcare, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Bus Stop
plot_busstop <- ggplot(resale_final, aes(x = proximity_to_busstop, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Bus Stop", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_busstop.png", plot = plot_busstop, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs No. of Bus Stops within 350m
plot_within_busstop <- ggplot(resale_final, aes(x = within_350m_busstop, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. No. of Bus Stops within 350m", x = "No. of Bus Stops", y = "Resale Price")
ggsave("plot_resale_price_vs_within_busstop.png", plot = plot_within_busstop, width = 7, height = 5, dpi = 300)
# Scatter plot: Resale price vs Proximity to Nearest Mall
plot_mall <- ggplot(resale_final, aes(x = proximity_to_mall, y = resale_price)) +
geom_point(alpha = 0.5) +
labs(title = "Resale Price vs. Proximity to Nearest Mall", x = "Distance (km)", y = "Resale Price")
ggsave("plot_resale_price_vs_mall.png", plot = plot_mall, width = 7, height = 5, dpi = 300)

Resale Price vs. Area (sqm): There is a strong positive relationship, with resale prices increasing as the floor area of the property increases, indicating that larger flats are valued higher.

Resale Price vs. Storey: Higher floors appear to have a slight positive effect on resale prices, possibly due to better views or reduced noise.

Resale Price vs. Age of Flat (months): Older flats show a decrease in resale price, indicating that newer flats are generally more desirable.

Resale Price vs. Proximity to Nearest MRT exits: Properties closer to MRT exits have higher resale prices, showing that accessibility to public transportation is a key factor in property valuation.

Resale Price vs. Proximity to Nearest Good Primary School: There is no strong correlation between proximity to a good primary school and resale prices, indicating limited impact on value.

Resale Price vs. No. of Good Primary Schools within 1km: The resale price seems to show minimal variation with the number of good primary schools within 1km, indicating that proximity to schools may not strongly influence pricing.

Resale Price vs. Proximity to Nearest Eldercare: Resale prices tend to decrease slightly as distance from the nearest eldercare facility increases, suggesting that closer proximity to eldercare may have a small positive impact on property value.

Resale Price vs. Proximity to Nearest CHAS Clinic: There is a weak negative correlation, where resale prices tend to be slightly higher when closer to CHAS clinics, though the effect appears minimal.

Resale Price vs. Proximity to Nearest Supermarket: Resale prices appear higher when closer to supermarkets, implying that proximity to essential amenities like supermarkets positively impacts housing prices.

Resale Price vs. Proximity to Nearest Hawker Center: Resale prices decrease with distance from hawker centers, suggesting that closeness to affordable food options is a desirable attribute.

Resale Price vs. Proximity to Nearest Park: Proximity to parks shows some clustering of higher resale prices near parks, but the relationship is not very strong, as there is a wide spread across various distances.

Resale Price vs. Number of Childcares within 350m: The number of nearby childcares shows a more scattered trend, with resale prices higher around lower numbers of nearby childcare facilities. This may suggest that too many nearby childcares don’t significantly enhance property value.

Resale Price vs. Proximity to Nearest Bus Stop: There is a very dense clustering of data points close to bus stops, with no strong indication of increased resale price with proximity to bus stops. The resale prices are distributed widely regardless of proximity.

Resale Price vs. Number of Bus Stops within 350m: The resale price does not show a clear trend relative to the number of nearby bus stops. Although resale prices are generally higher for areas with fewer bus stops within 350m, the spread remains wide across different bus stop counts, suggesting that proximity to multiple bus stops within a short distance may not strongly influence resale price.

Resale Price vs. Proximity to Nearest Mall: Resale prices tend to be slightly higher when closer to malls, indicating that proximity to malls may have a mild positive influence on prices. However, there’s still a significant spread across all distances.
A violin plot is also an excellent choice for visualizing the distribution of resale prices across different regions and flat models because it combines elements of both a box plot and a kernel density plot. Here are specific reasons why a violin plot is suitable:
-
Distribution Insight: Violin plots provide a clear view of the distribution’s shape for each category, revealing where resale prices are concentrated and if there are multiple modes (peaks) in the data.
-
Comparison Across Categories: By plotting different regions and flat models side-by-side, we can easily compare the price distributions across categories and detect variations in spread, skewness, or the presence of outliers.
-
Density Information: Unlike box plots, violin plots show the density of the data at different price levels. This is helpful in understanding if resale prices cluster around certain values or if they are more uniformly distributed.
We’ll analyse 4-room flats specifically based on their popularity and representativeness in Singapore’s housing market. Four-room flats are among the most common flat types in Singapore’s HDB system, making them a relevant indicator of general market trends. By focusing on this category, we can gain insights into typical resale prices across regions while comparing different flat models. This narrower focus allows for more meaningful and consistent comparisons without the variability introduced by other flat types, which may have different price dynamics due to their size and demand characteristics.
# Filter for 4-room flats only
filtered_data <- resale_tidy %>%
filter(flat_type == "4 ROOM") # Filtering for 4-room flats
# Create the violin plot with facets for each region, using color for flat model
plot <- ggplot(filtered_data, aes(x = "", y = resale_price, fill = flat_model)) +
geom_violin(trim = FALSE) +
facet_wrap(~ region, scales = "free", ncol = 2) + # Arrange facets in two columns
labs(title = "Resale Prices of 4-Room Flats by Flat Model in Each Region",
x = "Region", y = "Resale Price") +
scale_y_continuous(labels = scales::dollar_format()) +
theme_minimal() +
theme(axis.text.x = element_blank(), # Remove x-axis text for readability
axis.title.x = element_blank()) +
guides(fill = guide_legend(title = "Flat Model"))
# Save the plot as an image file
ggsave("resale_prices_violin_plot.png", plot = plot, width = 10, height = 8)

The violin plot above provides insights into the distribution of resale prices for 4-room flats across various regions in Singapore, segmented by flat model. Here are some key observations:
-
Regional Price Variability:
-
Central Region: The resale prices in the Central region have the highest variability, with several flat models (such as Premium Apartment, DBSS) reaching close to $1,500,000. This reflects the higher demand and premium associated with centrally located flats in Singapore.
-
North-East Region: Prices here are generally clustered within a narrower range between $500,000 and $800,000, with DBSS and Premium Apartment models showing higher resale values.
-
East Region: Similar to the North-East, the East region shows moderate resale values between $500,000 and $800,000, with DBSS flats on the higher end.
-
North and West Regions: These regions display the lowest price variability, with most models staying between $400,000 and $800,000. In the North, flat models like DBSS and Premium Apartment tend to show slightly higher resale values, though not as high as those in the Central or East regions.
-
-
Flat Model Influence on Prices:
-
DBSS (Design, Build, and Sell Scheme) and Premium Apartment models generally fetch higher resale prices across all regions. This is likely due to the enhanced design, additional amenities, and higher quality associated with these models.
-
Standard and Simplified models tend to have lower resale prices, indicating they are more basic options with fewer amenities, and thus appeal to buyers looking for affordability over additional features.
-
New Generation and Model A are popular across regions but usually fall into a mid-range price point, indicating a balanced offering of quality and affordability.
-
-
Distribution and Density:
-
Each violin plot represents the density of resale prices for a given flat model within a region. Wider areas in the violin plot indicate a higher concentration of resale transactions at a specific price point.
-
For example, in the Central region, Premium Apartment flats have a wide spread, suggesting a high concentration of resale prices around the median. However, in the North-East region, DBSS flats have a narrower spread, indicating that resale prices are more concentrated within a smaller range.
-
-
Insights by Region:
-
Central Region: Dominated by higher-value flats, with some models displaying very high resale prices due to the prime location.
-
East and North-East Regions: These regions offer a good mix of mid-to-high resale values, particularly for DBSS and Premium Apartment flats.
-
North and West Regions: These are more affordable regions, with flat models generally priced lower compared to the Central area.
-
In summary, the Central region commands the highest prices, likely due to its proximity to the city center, while the other regions offer a range of mid- to high-priced flats, depending on the model. DBSS and Premium Apartments consistently have higher resale prices across all regions, indicating their premium value in the market. This plot helps identify patterns in resale prices by region and flat model, which could be valuable for buyers, investors, and policymakers.
4.3 Spatial Distribution Analysis
A spatial distribution analysis allows us to understand how resale prices are geographically distributed
across different regions of Singapore. By converting the dataset into a spatial format (e.g.,
sf object with latitude and longitude coordinates), we can create choropleth maps to show
variations in resale prices. Using color gradients to represent different price levels, we can visually
identify high-value and low-value clusters across Singapore.
This approach highlights specific regions where resale prices are consistently higher or lower, such as areas near central business districts or regions with greater access to amenities. Spatial distribution analysis is particularly useful for understanding location-based trends and identifying regions that may warrant further investigation or targeted modeling techniques.
# Visualize spatial distribution of resale prices
tmap_mode("view")
tmap mode set to interactive viewing
tm_shape(resale_final) +
tm_dots(col = "resale_price", palette = "YlOrRd", title = "HDB Resale Prices") +
tm_layout(title = "Spatial Distribution of HDB Resale Prices")